The bioinformatics analysis pipeline nfcore/ampliseq is used for amplicon sequencing, supporting denoising of any amplicon and supports a variety of taxonomic databases for taxonomic assignment including 16S, ITS, CO1 and 18S.
Pipeline input was saved in folder input.
Sequencing data was provided in the samplesheet file
Samplesheet.tsv that is displayed below:
Metadata associated with the sequencing data was provided in
PAN_Metadata.tsv and is displayed below:
FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. The sequence quality was checked using FastQC and resulting data was aggregated using the FastQC module of MultiQC. For more quality controls and per sample quality checks you can check the full MultiQC report, which can be found in multiqc/multiqc_report.html.
Additional quality filtering can improve sequence recovery. Often it is advised trimming the last few nucleotides to avoid less well-controlled errors that can arise there. Forward reads were trimmed at 275 bp and reverse reads were trimmed at 265 bp, reads shorter than this were discarded. Reads with more than 2 expected errors were discarded. Read counts passing the filter are shown in section ‘Read counts per sample’ column ‘filtered’.
Quality profiles:
Forward (left) and reverse (right) read quality stats for incoming data:
Forward (left) and reverse (right) read quality stats for preprocessed data:
Overall read quality profiles are displayed as heat map of the
frequency of each quality score at each base position. The mean quality
score at each position is shown by the green line, and the quartiles of
the quality score distribution by the orange lines. The red line shows
the scaled proportion of reads that extend to at least that position.
Original plots can be found in folder dada2/QC/ with names that end in
_qual_stats.pdf.
DADA2 performs fast and accurate sample inference from amplicon data with single-nucleotide resolution. It infers exact amplicon sequence variants (ASVs) from amplicon data with fewer false positives than many other methods while maintaining high sensitivity.
DADA2 reduces sequence errors and dereplicates sequences by quality filtering, denoising, read pair merging (for paired end Illumina reads only) and PCR chimera removal.
Read error correction was performed using estimated error rates, visualized below. Error rates for forward reads are at the left side and reverse reads are at the right side.
Estimated error rates are displayed for each possible transition. The
black line shows the estimated error rates after convergence of the
machine-learning algorithm. The red line shows the error rates expected
under the nominal definition of the Q-score. The estimated error rates
(black line) should be a good fit to the observed rates (points), and
the error rates should drop with increased quality. Original plots can
be found in folder dada2/QC/ with names that
end in .err.pdf.
Tracking read numbers through DADA2 processing steps for each sample. The following table shows the read numbers after each processing stage. Processing stages are: input - read pairs into DADA2, filtered - read pairs passed quality filtering, denoisedF - forward reads after denoising, denoisedR - reverse reads after denoising, merged - successfully merged read pairs, nonchim - read pairs in non-chimeric sequences (final ASVs).
Samples with unusual low reads numbers relative to the number of expected ASVs should be treated cautiously, because the abundance estimate will be very granular and might vary strongly between (theoretical) replicates due to high impact of stochasticity.
Following, the numbers of the table above are shown in stacked barcharts as percentage of DADA2 input reads. Stacked barchart of read pair numbers (denoisedF & denoisedR halfed, because each pair is split) per sample and processing stage (display 10 samples of each lowest and highest percentage of reads analysed, of 117 samples):
Between 29.81% and 44.06% reads per sample (average 36.8%) were retained for analysis within DADA2 steps.
The proportion of lost reads per processing stage and sample should not be too high, totalling typically <50%. Samples that are very different in lost reads (per stage) to the majority of samples must be compared with caution, because an unusual problem (e.g. during nucleotide extraction, library preparation, or sequencing) could have occurred that might add bias to the analysis.
Finally, 3880 amplicon sequence variants (ASVs) were obtained across
all samples. The ASVs can be found in dada2/ASV_seqs.fasta. And the
corresponding quantification of the ASVs across samples is in dada2/ASV_table.tsv. An extensive
table containing both was saved as dada2/DADA2_table.tsv. ASVs were
inferred for each sample independently.
Barrnap classifies the ASVs into the origin domain (including mitochondrial origin).
Barrnap classified 3868 ( 99.69 %) ASVs as most similar to Bacteria,
0 ( 0 %) ASVs to Archea, 0 ( 0 %) ASVs to Mitochondria, 0 ( 0 %) ASVs to
Eukaryotes, and 12 ( 0.31 %) were below similarity threshold to any
kingdom.
rRNA classification results can be found in folder barrnap.
The taxonomic classification was performed by DADA2 using the
database: Silva 138.1 prokaryotic SSU. More details about
the reference taxonomy database can be found in the ‘Methods section’.
DADA2 classified 99.95 % ASVs at Kingdom level, 99.25 % ASVs at Phylum level, 98.27 % ASVs at Class level, 97.06 % ASVs at Order level, 89.23 % ASVs at Family level, 58.71 % ASVs at Genus level, 19.41 % ASVs at Species level, 0.39 % ASVs at Species_exact level.
DADA2 taxonomy assignments can be found in folder dada2 in files ASV_tax_*.tsv.
Files that were input to QIIME2 can be found in folder qiime2/input/. Results of taxonomic classification of DADA2 was used in all following analysis, see in the above sections.
Unwanted taxa are often off-targets generated in PCR with primers
that are not perfectly specific for the target DNA. For 16S rRNA
sequencing mitrochondria and chloroplast sequences are typically removed
because these are frequent unwanted non-bacteria PCR products. ASVs were
removed when the taxonomic string contained any of
mitochondria,chloroplast (comma separated). Consequently,
3880 ASVs were reduced by 0 ( 0 %) to 3880 . The following table shows
read counts for each sample before and after filtering:
Tables with read count numbers and filtered abundance tables are in folder qiime2/abundance_tables.
The abundance tables are the final data for further downstream analysis and visualisations. The tables are based on the computed ASVs and taxonomic classification, but after removal of unwanted taxa. Folder qiime2/abundance_tables contains tap-separated files (.tsv) that can be opened by any spreadsheet software.
Absolute abundance tables produced by the previous steps contain count data, but the compositional nature of 16S rRNA amplicon sequencing requires sequencing depth normalisation. This step computes relative abundance tables using TSS (Total Sum Scaling normalisation) for various taxonomic levels and detailed tables for all ASVs with taxonomic classification, sequence and relative abundance for each sample. Typically used for in depth investigation of taxa abundances. Folder qiime2/rel_abundance_tables contains tap-separated files (.tsv) that can be opened by any spreadsheet software.
Interactive abundance plot that aids exploratory browsing the discovered taxa and their abundance in samples and allows sorting for associated meta data. Folder qiime2/barplot contains barplots, click qiime2/barplot/index.html to open it in your web browser.
Produces rarefaction plots for several alpha diversity indices, and is primarily used to determine if the richness of the samples has been fully observed or sequenced. If the slope of the curves does not level out and the lines do not become horizontal, this might be because the sequencing depth was too low to observe all diversity or that sequencing error artificially increases sequence diversity and causes false discoveries.
Folder qiime2/alpha-rarefaction contains the data, click qiime2/alpha-rarefaction/index.html to open it in your web browser.
Diversity measures summarize important sample features (alpha diversity) or differences between samples (beta diversity). Diversity calculations are based on sub-sampled data rarefied to 9201 counts.
Alpha diversity measures the species diversity within samples. Please note that ASVs were inferred for each sample independently, that can make alpha diversity indices a poor estimate of true diversity. This step calculates alpha diversity using various methods and performs pairwise comparisons of groups of samples. It is based on a phylogenetic tree of all ASV sequences. Folder qiime2/diversity/alpha_diversity contains the alpha-diversity data:
Alpha diversity is considered not trustworthy when it correlates positively with sequencing depth. Spearman’s rank correlation was calculated for total counts per sample after all filtering steps (in folder qiime2/abundance_tables) with alpha diversity measures. Significant positive correlation was found for shannon_vector, faith_pd_vector, observed_features_vector:
Scatter plots with linear regression line (blue) with 95% confidence interval (gray shaded area).
Beta diversity measures the species community differences between samples. This step calculates beta diversity distances using various methods and performs pairwise comparisons of groups of samples. Additionally, principle coordinates analysis (PCoA) plots are produced that can be visualized with Emperor in your default browser without the need for installation. These calculations are based on a phylogenetic tree of all ASV sequences. Folder qiime2/diversity/beta_diversity contains the beta-diverity data:
Statistics on differences between specific metadata groups that can
be found in folder qiime2/diversity/beta_diversity/.
Each significance test result is in its separate folder following the
scheme {method}_distance_matrix-{treatment}:
qiime2/diversity/beta_diversity/bray_curtis_distance_matrix-Injury/index.html
qiime2/diversity/beta_diversity/bray_curtis_distance_matrix-Sex/index.html
qiime2/diversity/beta_diversity/bray_curtis_distance_matrix-Timepoint/index.html
qiime2/diversity/beta_diversity/bray_curtis_distance_matrix-Treatment/index.html
qiime2/diversity/beta_diversity/jaccard_distance_matrix-Injury/index.html
qiime2/diversity/beta_diversity/jaccard_distance_matrix-Sex/index.html
qiime2/diversity/beta_diversity/jaccard_distance_matrix-Timepoint/index.html
qiime2/diversity/beta_diversity/jaccard_distance_matrix-Treatment/index.html
qiime2/diversity/beta_diversity/unweighted_unifrac_distance_matrix-Injury/index.html
qiime2/diversity/beta_diversity/unweighted_unifrac_distance_matrix-Sex/index.html
qiime2/diversity/beta_diversity/unweighted_unifrac_distance_matrix-Timepoint/index.html
qiime2/diversity/beta_diversity/unweighted_unifrac_distance_matrix-Treatment/index.html
qiime2/diversity/beta_diversity/weighted_unifrac_distance_matrix-Injury/index.html
qiime2/diversity/beta_diversity/weighted_unifrac_distance_matrix-Sex/index.html
qiime2/diversity/beta_diversity/weighted_unifrac_distance_matrix-Timepoint/index.html
qiime2/diversity/beta_diversity/weighted_unifrac_distance_matrix-Treatment/index.html
Analysis of Composition of Microbiomes (ANCOM) is applied to identify features that are differentially abundant across sample groups. A key assumption made by ANCOM is that few taxa (less than about 25%) will be differentially abundant between groups otherwise the method will be inaccurate. Comparisons between groups of samples is performed for specific metadata that can be found in folder qiime2/ancom/.
Test results are in separate folders following the scheme
Category-{treatment}-{taxonomic level}:
Phyloseq is a popular R package to analyse and visualize microbiom data. The produced RDS files contain phyloseq objects and can be loaded directely into R and phyloseq. The objects contain an ASV abundance table and a taxonomy table. If available, metadata and phylogenetic tree will also be included in the phyloseq object. The files can be found in folder phyloseq.
Data was processed using nf-core/ampliseq version 2.8.0 (doi: 10.5281/zenodo.1493841) (Straub et al., 2020) of the nf-core collection of workflows (Ewels et al., 2020), utilising reproducible software environments from the Bioconda (Grüning et al., 2018) and Biocontainers (da Veiga Leprevost et al., 2017) projects.
Data quality was evaluated with FastQC (Andrews, 2010) and summarized with MultiQC (Ewels et al., 2016).
Sequences were processed sample-wise (independent) with DADA2 (Callahan et al., 2016) to eliminate PhiX contamination, trim reads (forward reads at 275 bp and reverse reads at 265 bp, reads shorter than this were discarded), discard reads with > 2 expected errors, correct errors, merge read pairs, and remove polymerase chain reaction (PCR) chimeras; ultimately, 3880 amplicon sequencing variants (ASVs) were obtained across all samples. Between 29.81% and 44.06% reads per sample (average 36.8%) were retained. The ASV count table contained in total 2337653 counts, at least 9201 and at most 34821 per sample (average 19980).
Taxonomic classification was performed by DADA2 and the database
‘Silva 138.1 prokaryotic SSU’
(Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 2013 Jan;41(Database issue):D590-6. doi: 10.1093/nar/gks1219. Epub 2012 Nov 28. PMID: 23193283; PMCID: PMC3531112.).
ASV sequences, abundance and DADA2 taxonomic assignments were loaded into QIIME2 (Bolyen et al., 2019).
Within QIIME2, the final microbial community data was visualized in a barplot, evaluated for sufficient sequencing depth with alpha rarefaction curves, investigated for alpha (within-sample) and beta (between-sample) diversity after rarefaction to 9201 counts, used to find differential abundant taxa with ANCOM (Mandal et al., 2015).
WARNING This methods section is lacking software versions, these can be found in MultiQC’s report section Software Versions or in folder pipeline_info file
software_versions.yml.
Taxonomic classification by DADA2:
database: Silva 138.1 prokaryotic SSU
files:
[https://zenodo.org/record/4587955/files/silva_nr99_v138.1_wSpecies_train_set.fa.gz, https://zenodo.org/record/4587955/files/silva_species_assignment_v138.1.fa.gz]
citation:
Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 2013 Jan;41(Database issue):D590-6. doi: 10.1093/nar/gks1219. Epub 2012 Nov 28. PMID: 23193283; PMCID: PMC3531112.
MultiQC summarized computational methods in multiqc/multiqc_report.html. The proposed short methods description can be found in MultiQC’s Methods Description, versions of software collected at runtime in MultiQC’s Software Versions, and a summary of non-default parameter in MultiQC’s Workflow Summary.
Technical information to the pipeline run are collected in folder pipeline_info, including software versions
collected at runtime in file software_versions.yml (can be
viewed with a text editor), all parameter settings in file
params_{date}_{time}.json (can be viewed with a text
editor), execution report in file
execution_report_{date}_{time}.html, execution trace in
file execution_trace_{date}_{time}.txt, execution timeline
in file execution_timelime_{date}_{time}.html, and pipeline
direct acyclic graph (DAG) in file
pipeline_dag_{date}_{time}.html.
This report (file summary_report.html) is located in
folder summary_report of the original pipeline results
folder. In this file, all links to files and folders are relative,
therefore hyperlinks will only work when the report is at its original
place in the pipeline results folder. Plots specifically produced for
this report (if any) can be also found in folder summary_report.
A comprehensive read count report throughout the pipeline can be
found in the base results folder in file
overall_summary.tsv.
Please cite the pipeline publication and any software tools used by the pipeline (see citations) when you use any of the pipeline results in your study.